AAI1001 Team 7 Project Proposal

Authors

Guo Zi Qiang Robin

Chew Tze Han

Cheong Wai Hong Jared

Akram

Gregory Tan

1 Project Proposal

Please ensure that you have installed these packages in your R environment before running the code chunks below.

# Load necessary libraries
library(tidyverse) # for data manipulation and visualization
library(metR) # for contour labels
library(viridis) # for color scales
library(ggpp) # for position_dodgenudge 
library(ggrepel) # for text repelling
library(directlabels) # for categorical text labeling
library(ggforce) # for enclosing shapes
library(shiny) # for interactiveness in ggplots
library(janitor) # for cleaning 
library(ggplot2) # for plotting
library(plotly) # for interactive plots

2 Introduction

Using various packages in R, we will create a poster that thoughtfully displays the socioeconomic factors that influence fertility/birth rates in Singapore. To do so we will be using fertility rate data sourced from SingStat as well as labour participation and marital status data from data.gov.sg. To note that data for 1995, 2000 and 2005 are not available as the Comprehensive Labour Force Survey was not conducted in these years due to the conduct of the Population Census 2000, General Household Surveys 1995 and 2005 by the Singapore Department of Statistics.

2.1 Analysis of Original Visualisations

The original visualisations from the Straits Times are shown below,

Total fertility rate from 2019 to 2023

Visualisations adapted from Straits Times: Singapore’s total fertility rate hits record low in 2023, falls below 1 for first time

The original visualisations focused on two variables: Singapore’s total fertility rate (quantitative) and years (quantitative) from 2019 to 2023. While these visualisations provide a clear overview of the declining fertility trend, they lack depth in exploring the underlying socioeconomic factors that contribute to this trend.

In a more recent article, Straits Times: Why the fertility rate doesn’t capture socio-economic or cultural trends, it critiques the visualizations represented, stating that the first article statistically misrepresented the data to show that single women are having less kids but not accounting for other socioeconomic factors that led to the decline in fertility rate.

Key Weaknesses:

  1. The visualization implies causality (e.g., being single = lower fertility) without data to back it up.

  2. It does not account for other socioeconomic variables such as trends in workforce participation.

  3. Lacks clarity in age-group breakdowns and trends across time.

Strengths:

  1. Clean and clear formatting.

  2. Good headline-grabbing summary.

  3. Uses official government data.

2.2 Proposed Improvements

These are the improvements that our group has identified:

  1. Interactivity: To show the difference in fertility trends between women in and out of the workforce, by age band and marital status. This will allow users to explore the data in a more engaging way and draw their own conclusions.

  2. Age band: To compare fertility rates across different age groups, allowing for a more nuanced understanding of how age impacts fertility.

  3. Marital status: Shows marriage trends over the years and how they relate to fertility rates, providing insights into the relationship between marital status and fertility.

  4. Female Labour Participation: To show difference between labour force population

The final visualisation that we plan to create will have 5 variables:

  1. Year: time series x-axis variable

  2. Count (in thousands): quantitative y-axis variable that accommodates the unit of measurement from our datasets. Count will represent Births Per Thousand Females and population of Females in the labour force (in thousands)

  3. marital status: categorical variable that will be differentiated by colour and appear as geom_col. There will be 3 marital statuses (single, married, widowed/divorced) that will be represented as separate bars in the visualisation

  4. age band: categorical variable that will be differentiated by colour and appear as geom_line. There will be 7 age bands (15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49) and total fertility rate that will be represented as separate lines in the visualisation

  5. labour participation: categorical variable that is switchable to show separate faceted graphs

We will also consider the additions of other variables such as yearly housing prices or other factors that may affect fertility rate.

Thus, a low fidelity mockup of the final visualisation is shown below:

Low fidelity mockup of final visualisation

3 Planned work distribution for each team member

Team Member Tasks
Guo Zi Qiang Robin Data cleaning, data engineering, and visualisation
Chew Tze Han Data cleaning, data engineering, and visualisation
Cheong Wai Hong Jared Data cleaning, data engineering, and visualisation
Akram Data cleaning, data engineering, and visualisation
Gregory Tan Data cleaning, data engineering, and visualisation

4 Data Engineering

The datasets will be loaded as 3 tibbles:

Dataset Description
fertility Includes fertility rate data; contains metadata which needs to be skipped
work Includes data on males and females in the labour force and marital status
not_working Includes data on males and females out of the labour force and marital status

4.1 Data Loading

fertility <- read_csv(
  "datasets/ResidentFertilityRate.csv",
  skip = 9,
  n_max = 17,
  show_col_types = FALSE
)


work <- read_csv("datasets/ResidentLabourForceAged15YearsandOverbyMaritalStatusAgeandSex.csv", show_col_types = FALSE)
not_working <- read_csv("datasets/ResidentsOutsidetheLabourForceAged15YearsandOverbyMaritalStatusAgeandSex.csv", show_col_types = FALSE)

4.2 Cleaning

The tibbles “fertility”, “work”, and “not_working” will be cleaned and engineered for our interactive visualisation of fertility rates.

4.2.1 fertility

The following steps will be taken to clean and reshape “fertility”:

  • “fertility” tibble contains “na” strings which are not actually NA values, these points will need to be converted to NA values

  • fertility rate data from SingStat is in wide format with years as the columns, we will pivot long for year-wise plots

  • fertility rate data goes up till 2024, whereas the labour force data only goes up till 2022, we will filter the fertility rate data to only include years after 1990 and up till 2022

  • standardise age banding of fertility rate dataset to be consistent with labour force data. For example, “15-19” instead of “15 - 19 Years (Per Thousand Females)’

  • filtered to include age specific fertility rates and the total fertility rate by year

  • introduce Unit of Measurement (uom) column to indicate scaling for total fertility rates and age banded fertility rates

# Clean and reshape fertility data
fertility_clean <- fertility |>
  clean_names() |>
  rename(measure = data_series) |>
  # Convert all columns to character to handle mixed types
  mutate(across(-measure, as.character)) |>
  pivot_longer(
    cols = -measure,
    names_to = "year",
    values_to = "value"
  ) |>
  mutate(
    year = as.numeric(str_remove(year, "^x")),  # Remove leading "x"
    measure = str_trim(measure),
    value = ifelse(tolower(value) == "na", NA, value),  # Handle "na" strings
    value = as.numeric(value)
  ) |>
  # Extract age bands and filter for only age-specific rates
  mutate(
    age_band = case_when(
      measure == "Total Fertility Rate (TFR) (Per Female)" ~ "All",
      str_detect(measure, "15 - 19") ~ "15-19",
      str_detect(measure, "20 - 24") ~ "20-24",
      str_detect(measure, "25 - 29") ~ "25-29",
      str_detect(measure, "30 - 34") ~ "30-34",
      str_detect(measure, "35 - 39") ~ "35-39",
      str_detect(measure, "40 - 44") ~ "40-44",
      str_detect(measure, "45 - 49") ~ "45-49",
      TRUE ~ NA_character_
    )
  ) |>
  filter(!is.na(age_band)) |>  # Keep only age band rows
  mutate(
    uom = case_when(
      age_band == "All" ~ "per female",
      TRUE              ~ "per thousand females")
  ) |>
  filter(year > 1990 & year <= 2020) |>
  select(year, age_band, fertility_rate = value, uom)

4.2.2 not_working

The following steps will be taken to clean and reshape “not_working”:

  • standardise column names to the 7 (15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49) age bands to be consistent with fertility and remove extra bandings

  • for labour datasets, divide labour_force values by 1000 to align with count (in thousands) y-axis variable

  • some outside_labour_force values are “-” which are not valid numerics, convert these to NA

  • rename age column to age_band to match fertility

  • aggregate age bands to introduce All to represent population outside labour force by year and marital status only, this is so that we can introduce interactivity with total fertility rate and fertility rates across age bands

# Clean and reshape not_working data
not_working_clean <- not_working |>
  clean_names() |>
  # Filter for the 7 age bands only
  filter(age %in% c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49")) |>
  # Replace "-" with NA and convert to numeric
  mutate(
    outside_labour_force = na_if(outside_labour_force, "-"),
    outside_labour_force = as.numeric(outside_labour_force),
    outside_labour_force = outside_labour_force / 1000,  # Convert to thousands
    age_band = age  # Rename for consistency with fertility_clean
  ) |>
  select(year, sex, marital_status, age_band, outside_labour_force)

# Create aggregated "All" row by year, sex, and marital_status
not_working_all <- not_working_clean |>
  group_by(year, sex, marital_status) |>
  summarise(
    age_band = "All",
    outside_labour_force = sum(outside_labour_force, na.rm = TRUE),
    .groups = "drop"
  )

# Combine original cleaned data with the aggregated "All" row
not_working_clean <- bind_rows(not_working_clean, not_working_all)

4.2.3 work

“work” tibble is cleaned in a similar way to “not_working”.

# Clean and reshape work data
work_clean <- work |>
  clean_names() |>
  # Filter to only the 7 standard age bands
  filter(age %in% c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49")) |>
  # Replace "-" with NA and convert labour_force to numeric
  mutate(
    labour_force = na_if(labour_force, "-"),
    labour_force = as.numeric(labour_force),
    labour_force = labour_force / 1000,  # Convert to thousands
    age_band = age  # Rename to match other datasets
  ) |>
  select(year, sex, marital_status, age_band, labour_force)

# Create aggregated "All" row by year, sex, and marital_status
work_all <- work_clean |>
  group_by(year, sex, marital_status) |>
  summarise(
    age_band = "All",
    labour_force = sum(labour_force, na.rm = TRUE),
    .groups = "drop"
  )

# Combine original cleaned data with the aggregated "All" row
work_clean <- bind_rows(work_clean, work_all)

4.3 Outliers

Using the IQR method to identify outliers in the datasets, we will plot a time series line graph to visualise the points where the outliers occur.

# IQR-based outlier detection function
detect_outliers_iqr <- function(df, value_col, group_cols) {
  df |>
    group_by(across(all_of(group_cols))) |>
    mutate(
      Q1 = quantile(.data[[value_col]], 0.25, na.rm = TRUE),
      Q3 = quantile(.data[[value_col]], 0.75, na.rm = TRUE),
      IQR = Q3 - Q1,
      lower_bound = Q1 - 1.5 * IQR,
      upper_bound = Q3 + 1.5 * IQR,
      is_outlier = .data[[value_col]] < lower_bound | .data[[value_col]] > upper_bound
    ) |>
    ungroup()
}


# Apply to each dataset and exclude All (which is all ages)
fertility_outliers_flagged <- fertility_clean |>
  filter(age_band != "All") |>
  detect_outliers_iqr("fertility_rate", "age_band")

work_outliers_flagged <- work_clean |>
  filter(age_band != "All") |>
  detect_outliers_iqr("labour_force", c("age_band", "marital_status", "sex"))

not_working_outliers_flagged <- not_working_clean |>
  filter(age_band != "All") |>
  detect_outliers_iqr("outside_labour_force", c("age_band", "marital_status", "sex"))
# Fertility rate outliers (excluding Total Fertility Rate)
ggplot(
  filter(fertility_outliers_flagged, age_band != "All"),
  aes(x = year, y = fertility_rate, color = age_band, group = age_band)
) +
  geom_line(linewidth = 0.8) +
  geom_point(
    data = filter(fertility_outliers_flagged, is_outlier & age_band != "All"),
    color = "red", size = 2, shape = 21, fill = "white"
  ) +
  facet_wrap(~age_band, scales = "free_y") +
  labs(
    title = "Fertility Rates by Age Band with Outliers (Excluding Total Fertility Rate)",
    y = "Fertility Rate", x = "Year", color = "Age Band"
  ) +
  theme_minimal()

# Labour force (working)
ggplot(work_outliers_flagged, aes(x = year, y = labour_force, color = age_band, group = interaction(age_band, marital_status))) +
  geom_line(linewidth = 0.8) +
  geom_point(
    data = filter(work_outliers_flagged, is_outlier),
    shape = 21, fill = "white", color = "red", size = 2,
    position = position_jitter(width = 0.5, height = 0.2)
  ) +
  facet_grid(sex ~ marital_status, scales = "free_y") +
  labs(
    title = "Labour Force by Age Band, Marital Status and Sex",
    y = "Labour Force (in thousands)", x = "Year", color = "Age Band"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(size = 11),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
Warning: Removed 56 rows containing missing values or values outside the scale range
(`geom_line()`).

# Outside labour force (not working)
ggplot(not_working_outliers_flagged, aes(x = year, y = outside_labour_force, color = age_band, group = interaction(age_band, marital_status))) +
  geom_line(linewidth = 0.8) +
  geom_point(
    data = filter(not_working_outliers_flagged, is_outlier),
    shape = 21, fill = "white", color = "red", size = 2,
    position = position_jitter(width = 0.5, height = 0.2)
  ) +
  facet_grid(sex ~ marital_status, scales = "free_y") +
  labs(
    title = "Outside Labour Force by Age Band, Marital Status and Sex",
    y = "Outside Labour Force (in thousands)", x = "Year", color = "Age Band"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(size = 11),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
Warning: Removed 87 rows containing missing values or values outside the scale range
(`geom_line()`).

The graphs generally show consistency with reported fertility rates and labour force participation trends in Singapore.

5 Joining datasets

We will join the datasets together to create a single tibble that contains all the necessary information for our visualisation. The joined tibble will contain the following columns:

  • year: from 1990 to 2022

  • age_band: Age bands and “All” which is for total fertility rate

  • marital_status: Marital status of the data point

  • fertility_rate: Fertility rate by age band (per thousand females) and total fertility rate (per female)

  • uom: Fertility rate unit of measurement

  • labour_status: Labour status of the data point, either “labour_force” or “outside_labour_force”

  • count: Number of females either in workforce or outside workforce (in thousands)

# Filter labour data to only include females
work_clean_female <- work_clean |> 
  filter(sex == "female") |> 
  select(-sex)

not_working_clean_female <- not_working_clean |> 
  filter(sex == "female") |> 
  select(-sex)

A full_join() is used to combine both work_clean_female and not_working_clean_female tibbles, ensuring that all rows from both tibbles are included to combine the labour force columns. The join is done on the year, marital_status, and age_band columns, common dimensions to both tibbles to prevent any data loss.

# Combine female labour and not working into one tibble
labour_status_female <- full_join(
  work_clean_female, 
  not_working_clean_female, 
  by = c("year", "marital_status", "age_band")
)

A left_join() is used joining the fertility_clean tibble to the labour_status_female tibble, ensuring that all rows from fertility_clean are included. This will allow us to combine and be able to associate fertility rates with labour force participation data.

# Join to fertility
fertility_labour_joined_female <- fertility_clean |>
  left_join(labour_status_female, by = c("year", "age_band"))

Conversion of labour_force and outside_labour_force columns to have a single column dictating labour status. years that do not have corresponding labour force data (1995, 2000, 2005) are filtered out.

# Pivot longer and ensure no sex columns remain
final <- fertility_labour_joined_female |>
  pivot_longer(
    cols = c("labour_force", "outside_labour_force"),
    names_to = "labour_status",
    values_to = "count"
  ) |>
  group_by(year) |>
  filter(!all(is.na(count))) |> # to deal with missing years in labour force data
  ungroup() |>
  mutate(count = replace_na(count, 0))

  

final |> 
  slice_sample(n = 5)
# A tibble: 5 × 7
   year age_band fertility_rate uom           marital_status labour_status count
  <dbl> <chr>             <dbl> <chr>         <chr>          <chr>         <dbl>
1  2011 30-34             89.5  per thousand… married        outside_labo…  26.5
2  1996 40-44              7    per thousand… married        outside_labo…  57.1
3  2017 All                1.16 per female    married        labour_force  417. 
4  1996 40-44              7    per thousand… widowed_divor… outside_labo…   1  
5  2016 40-44              8.8  per thousand… single         labour_force   22.3

6 Data Visualisation

this uses ggplot which is what we want but uses shiny which does not render in html

# UI
ui <- fluidPage(
  titlePanel("Fertility and Labour Participation Over Time"),
  sidebarLayout(
    sidebarPanel(
      selectInput("labour_status", "Select Labour Status:",
                  choices = unique(final$labour_status),
                  selected = unique(final$labour_status)[1])
    ),
    mainPanel(
      plotlyOutput("interactive_plot")
    )
  )
)

# Server
server <- function(input, output) {
  output$interactive_plot <- renderPlotly({
    filtered_data <- final |>
      filter(age_band != "All", labour_status == input$labour_status)

    p <- ggplot(filtered_data, aes(x = year)) +
      geom_col(aes(y = count, fill = marital_status), position = "dodge", alpha = 0.7) +
      geom_line(aes(y = fertility_rate, color = age_band, group = age_band), linewidth = 0.5) +
      scale_fill_brewer(palette = "Set2") +
      scale_color_viridis_d(option = "D") +
      labs(
        title = paste("Fertility Rate and", gsub("_", " ", input$labour_status)),
        x = "Year",
        y = "Labour Count (Bar) & Fertility Rate ",
        fill = "Marital Status",
        color = "Age Band"
      ) +
      theme_minimal(base_size = 12) +
      theme(
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom",
        plot.title = element_text(face = "bold", hjust = 0.5)
      )

    ggplotly(p, tooltip = c("year", "count", "fertility_rate", "marital_status", "age_band")) |>
      layout(legend = list(orientation = "h", x = 0.1, y = -0.2))
  })
}

# Run the app
shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

this does not use ggplot and uses plotly which is not what we were taught but it renders in html

# Prepare fertility data
fertility_data <- final |>
  filter(age_band != "All") |>
  select(year, age_band, fertility_rate) |>
  distinct()

labour_statuses <- unique(final$labour_status)
age_bands <- unique(fertility_data$age_band)
colors <- RColorBrewer::brewer.pal(length(age_bands), "Set2")

trace_list <- list()

for (i in seq_along(labour_statuses)) {
  status <- labour_statuses[i]
  final_status <- final |> filter(labour_status == status)

  # Bar traces by marital status
  count_data <- final_status |>
    group_by(year, marital_status) |>
    summarise(count = sum(count), .groups = "drop")

  for (m in unique(count_data$marital_status)) {
    sub_bar <- count_data |> filter(marital_status == m)

    trace_list[[length(trace_list) + 1]] <- list(
      x = sub_bar$year,
      y = sub_bar$count,
      type = "bar",
      name = m,
      marker = list(opacity = 0.7),
      visible = (i == 1),
      legendgroup = m,
      showlegend = TRUE  # ← ALWAYS show legend
    )
  }

  # Line traces by age_band
  for (k in seq_along(age_bands)) {
    a <- age_bands[k]
    sub_line <- fertility_data |> filter(age_band == a)

    trace_list[[length(trace_list) + 1]] <- list(
      x = sub_line$year,
      y = sub_line$fertility_rate * 1000,  # Scale for y2
      type = "scatter",
      mode = "lines",
      name = a,
      line = list(color = colors[k], width = 2),
      yaxis = "y2",
      visible = (i == 1),
      legendgroup = a,
      showlegend = TRUE  # ← ALWAYS show legend
    )
  }
}

# Create base plot
combined_plot <- plot_ly()
combined_plot$x$data <- trace_list

# Visibility logic
n_per_status <- length(trace_list) / length(labour_statuses)
visibility_matrix <- lapply(seq_along(labour_statuses), function(i) {
  vis <- rep(FALSE, length(trace_list))
  vis[((i - 1) * n_per_status + 1):(i * n_per_status)] <- TRUE
  vis
})

# Dropdown menu
dropdown <- list(
  list(
    buttons = lapply(seq_along(labour_statuses), function(i) {
      list(
        method = "restyle",
        args = list("visible", visibility_matrix[[i]]),
        label = tools::toTitleCase(gsub("_", " ", labour_statuses[i]))
      )
    }),
    direction = "down",
    x = 0.02,
    y = 1.15,
    showactive = TRUE
  )
)

# Final layout
combined_plot <- combined_plot |>
  layout(
    updatemenus = dropdown,
    barmode = "group",
    title = list(text = "<b>Fertility Rate and Labour/Marital Status Over Time</b>"),
    xaxis = list(title = "Year"),
    yaxis = list(title = "Labour Count (Bar)"),
    yaxis2 = list(
      overlaying = "y", side = "right",
      title = "Fertility Rate (Line)"
    ),
    legend = list(orientation = "h", y = -0.25),
    margin = list(t = 80)
  )

combined_plot
Warning: No trace type specified and no positional attributes specified
No trace type specified:
  Based on info supplied, a 'scatter' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter
No scatter mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

7 References

Data.gov.sg. (n.d.-a). data.gov.sg. https://staging.data.gov.sg/datasets?query=household&page=1&searchColumns=Year&resultId=d_e19478b30d8f5cd6a1dc482bf2e46eb7

Data.gov.sg. (n.d.-b). data.gov.sg. https://staging.data.gov.sg/datasets?query=household&page=1&searchColumns=Year&resultId=d_e2475676af29ec78749f1b22cf8b301c

MacroTrends. (n.d.-a). Singapore unemployment rate. MacroTrends. Retrieved July 5, 2025, from https://www.macrotrends.net/global-metrics/countries/sgp/singapore/unemployment-rate

MacroTrends. (n.d.-b). Singapore population. MacroTrends. Retrieved July 5, 2025, from https://www.macrotrends.net/global-metrics/countries/sgp/singapore/population

Singapore Department of Statistics. (n.d.). Population by age group, sex and type of locality, 2023 [Table M810091]. Singapore Department of Statistics. Retrieved July 5, 2025, from https://tablebuilder.singstat.gov.sg/table/TS/M810091

Tan, T. (2024a, March 11). Singapore’s total fertility rate hits record low in 2023, falls below 1 for first time. The Straits Times. https://www.straitstimes.com/singapore/politics/singapore-s-total-fertility-rate-hits-record-low-in-2023-falls-below-1-for-first-time

Tan, T. (2024b, June 30). Why the fertility rate doesn’t capture socio-economic or cultural trends. The Straits Times. https://www.straitstimes.com/singapore/why-the-fertility-rate-doesn-t-capture-socio-economic-or-cultural-trends